CS4132 Data Analytics
American crosswords are word puzzles, in which the goal is to fill in all the white squares with letters to fit the given clues. The special rules of American crosswords in particular are that each square must be used twice and each word has to be at least 3 letters long. The answers are a mix of trivia and common phrases. There are two main criticisms of American crosswords.
The first criticism is how accessible it is. Sometimes, obscure terms must be used to fill the grid, as the constructor cannot find a better configuration. These obscure terms are called "crosswordese". They appear commonly in puzzles as they have convienent letter patterns. However, as computers have evolved to assist humans in construction, the quality of puzzles have been getting better and better.
The second criticism is how representative it is. Crosswords have been said to reflect a piece of personality from the contructor. Originally, crosswords were for straight liberally-educated white men. As time passed, people realised that there was a lack of representation in terms of answers or clue-writing for the other groups. Hence, there has been a push to include more of these people into the crossword. This includes mentorship to women/people of colour/LGBTQ people to construct crosswords.
This project would aim to find how accessible the crossword currently is, given the rise of computers as a construction aid. In a similar fashion, it would also like to find out how representative the crossword is of minorities.
Accessibility:
1. "Crosswordese"
How has the amount of obscure answers changed throughout the years? Crosswordese is the use of an obscure word with a convienent letter pattern, with many common letters or vowels sometimes to fill in the grid. These words make it hard for people to do them, if they are not part of an "in-group" that knows all these common crossword words. Hence, I would like to find out, how much "crosswordese" the crossword has over the years.
2. Freshness
How has the "freshness" factor of crossword changed over the years? Crosswords are a reflection of the world, what the current trends are and such. With more and more crosswords in the pool, the number of never seen before terms and names in the crossword has steadily decreased. However, words and phrases get coined every single day, some catching on in modern language. This question aims to investigate that, coupled with how computers have helped give more liberty to filling the grid.
Representation:
3. Inclusive Clues
How has the clue-writing changed over the years? Clue-writing is half of the puzzle. This could introduce some unwanted sterotypes. For example, the clue-writing for the answer MIT has been associated with males more than females, showing a bias in thinking that men are more prominent in tech. Hence, by finding out the mentions of minorities in clue-writing, one can find out how progressive the puzzle has become.
4. Constructors
How has the make-up of constructors changed over the years, and modified the quality of crosswords? As said, a crossword reflects a person's experiences and views. With more diverse make-up of constructors, there will be more variety for that. There was a time where the constructors were mostly men, skewing the quality for some. With the rising number of mentorships given to minorities by prolific constructors, however, there has been an uptick in minority constructors. This question would like to analyse the trend as a whole, as well as possibly see the impact of mentorship.
import aiohttp
import asyncio
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report.
Import and display each dataset in a dataframe.
For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.
Many of these datasets were scraped from the internet. The scraping code can be found in the Appendix.
Clue - What hint was given to the answer in this crossword?
Answer - What is the expected answer to the given clue?
Outlet - Where was the crossword published?
Date - When was the crossword published?
nyt=pd.read_csv("New York Times.csv")
lat=pd.read_csv("L.A. Times Daily.csv")
uni=pd.read_csv("Universal.csv")
usat=pd.read_csv("USA Today.csv")
wsj=pd.read_csv("Wall Street Journal.csv")
nyt.head()
| Clues | Answers | Outlet | Date | |
|---|---|---|---|---|
| 0 | 'High priority!' | RUSH | New York Times | Jan 27 1997 |
| 1 | 'We're number ___!' | ONE | New York Times | Jan 27 1997 |
| 2 | '___ Blue?' (1929 #1 hit) | AMI | New York Times | Jan 27 1997 |
| 3 | A.F.L.'s partner | CIO | New York Times | Jan 27 1997 |
| 4 | Adjusts to fit | ADAPTS | New York Times | Jan 27 1997 |
This dataset is obtained from XWordInfo, a site with extensive information on New York Times Crosswords. The dataset is named NYTCI, standing for New York Times Constructor Info. The dataset has 8 columns. First two are the Day Of Week and Date. Some crosswords made are collaborations, made with up to 3 people, hence C1,C2 and C3, which stand for Constructor 1, 2 and 3 respectively. For some crosswords, there are less than 3 constructors, hence their columns are dashed. C1,C2,C3 No. stand for the order of the puzzle the author has published up til now. C1, C2, C3 Gender stand for the genders of Constructors.
xwi=pd.read_csv("NYTCI.csv")
xwi.head()
| Day | Date | C1 No. | C1 Gender | C2 No. | C2 Gender | C3 No. | C3 Gender | |
|---|---|---|---|---|---|---|---|---|
| 0 | Saturday | January 1, 1994 | puzzle # 12 | Mr | - | - | - | - |
| 1 | Sunday | January 2, 1994 | puzzle # 5 | Mr | - | - | - | - |
| 2 | Monday | January 3, 1994 | puzzle # 155 | Mr | - | - | - | - |
| 3 | Tuesday | January 4, 1994 | the debut puzzle | Mr | - | - | - | - |
| 4 | Wednesday | January 5, 1994 | puzzle # 13 | Mr | - | - | - | - |
Taking the some of the most common answers over all crosswords, around 15K, we input it into Google NGram as an URL, which gives us scrapable data. We then take the relevant years of our crosswords, 1990-2020, and insert it into a DataFrame. However, sometimes the data is missing, for example I found no data for "ISNT". Since this is uncommon and shows no pattern that can be seen, it is reasonable to assume randomness and just ignore it. Here, there are 3 different files as some answers were scraped in different sessions The resulting DataFrame is displayed.
ngram=pd.read_csv("CommonAnswers.csv")
ngram2=pd.read_csv("CommonAnswers2.csv")
ngram3=pd.read_csv("CommonAnswers3.csv") #scraping from different sessions
ngram.set_index("Year",inplace=True)
ngram2.set_index("Year",inplace=True)
ngram3.set_index("Year",inplace=True)
ngram=pd.merge(left=ngram,right=ngram2,left_index=True,right_index=True)
ngram=pd.merge(left=ngram,right=ngram3,left_index=True,right_index=True)
ngram.head()
| ERA | AREA | ORE | ALOE | ERIE | ONE | ERE | ARIA | ALE | ATE | ... | itunes | angler | fracas | exs | strands | lender | antihero | suedes | esker | petrol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | |||||||||||||||||||||
| 1990 | 0.000003 | 0.000007 | 4.092760e-07 | 3.317504e-08 | 2.779331e-07 | 0.000009 | 1.515894e-07 | 6.222778e-08 | 4.411864e-07 | 4.286382e-07 | ... | 3.051807e-11 | 6.585536e-07 | 1.608488e-07 | 1.244890e-07 | 0.000004 | 0.000008 | 5.290083e-08 | 7.561240e-09 | 8.216753e-08 | 0.000002 |
| 1991 | 0.000002 | 0.000007 | 4.053728e-07 | 3.388514e-08 | 2.709312e-07 | 0.000009 | 1.486782e-07 | 6.383107e-08 | 4.381159e-07 | 4.241332e-07 | ... | 3.259970e-11 | 6.519289e-07 | 1.605182e-07 | 1.120769e-07 | 0.000004 | 0.000008 | 5.585587e-08 | 7.518122e-09 | 7.954383e-08 | 0.000002 |
| 1992 | 0.000002 | 0.000007 | 3.924445e-07 | 3.322866e-08 | 2.769078e-07 | 0.000009 | 1.500604e-07 | 6.771065e-08 | 4.270695e-07 | 4.216174e-07 | ... | 3.421498e-11 | 6.445361e-07 | 1.613821e-07 | 1.089869e-07 | 0.000004 | 0.000008 | 5.705645e-08 | 7.448963e-09 | 7.593409e-08 | 0.000002 |
| 1993 | 0.000002 | 0.000007 | 3.799626e-07 | 3.290970e-08 | 2.682965e-07 | 0.000009 | 1.496317e-07 | 6.641682e-08 | 4.092451e-07 | 4.072851e-07 | ... | 3.041332e-11 | 6.360911e-07 | 1.625221e-07 | 1.054417e-07 | 0.000004 | 0.000008 | 5.713023e-08 | 7.465821e-09 | 7.266345e-08 | 0.000002 |
| 1994 | 0.000002 | 0.000007 | 3.657307e-07 | 3.326257e-08 | 2.661885e-07 | 0.000008 | 1.495085e-07 | 6.875681e-08 | 4.011461e-07 | 4.004407e-07 | ... | 3.178341e-11 | 6.346831e-07 | 1.645288e-07 | 9.975008e-08 | 0.000004 | 0.000008 | 5.877106e-08 | 7.244199e-09 | 7.050995e-08 | 0.000002 |
5 rows × 25153 columns
In crossword construction, wordlists are used. These wordlists are fed into a program, which will help suggest the best configuration for a particular section, or even for the whole grid. As such, the wordlists are as comprehensive as possible, trying to maximise the number of configurations to pick from, to pick the best one to human eyes. These wordlists are usually scored by the author as well, giving a score of how good, in their opinion, an answer is. Using these wordlists, we can also check how "good" each crossword is, giving a quantifiable amount of weight to each asnwer.
Of course, these wordlists are biased based on who makes it. Hence, we shall try to use 2 independent wordlists to cross-check. There are many wordlists out there, however, these two are chosen as they are very comprehensive, but also free.
listA=pd.read_csv("peter-broda-wordlist__scored.txt",delimiter=";",header=None,dtype={'Column 1':int})
listA.columns=["Answer","Score"]
listA.set_index("Answer",inplace=True)
listA.head()
| Score | |
|---|---|
| Answer | |
| STY | 80 |
| SIGN | 85 |
| TRIO | 50 |
| YARD | 85 |
| DRAYS | 50 |
listB=pd.read_csv("spreadthewordlist_caps.txt",delimiter=";",header=None,dtype={'Column 1':int})
listB.columns=["Answer","Score"]
listB.set_index("Answer",inplace=True)
listB.head()
| Score | |
|---|---|
| Answer | |
| AAA | 50 |
| AAAA | 40 |
| AAAAAAAAAAAAAAA | 30 |
| AAAAAH | 20 |
| AAAADDRESS | 20 |
This dataset was acquired by simply going to the website and doing a copy-paste. This database will be used in Q3 to look for occurrences of their names. It will only be used as a lookup, not really a dataframe.
gNames=pd.read_csv("girl_names.txt",header=None)
gNames=set(gNames[0])
bNames=pd.read_csv("boy_names.txt",header=None)
bNames=set(bNames[0])
Since most of the data is scraped, I have been able to control the cleaniness of data, therefore, the quality and cleaniness of the data was high. Of course there were some hitches during the data collection. Missing data is rare and may not exist in the dataset.
For two of the datasets, CrosswordGiant and NGram, there was the possibility of the data not existing. This was simply handled by catching the exception/error that occurs when I tried to process the empty data. Hence, it is ensured that no data that is invalid is entered into the saved file. Again, the code is found in the Appendix.
For the namelist, no cleaning is required; that has already been done by the publisher. For checking symmetry, the dataset is very simple and acquired by scraping. Although some of the entries are incorrect, they are at random. This was caused by a logic error that I did not have the skills to fix. However, this should not affect the results significantly. However, no cleaning is requried.
For this dataset in particular, some webpages have garbage data, with the answers just being XXXXXXXX or being duplicated many times. The former is harder to detect, and can be cleaned later, when answering question 1. The latter can easily be removed by checking how many entries the crossword of the particular day has, then just removing them.
Sometimes, publications fill their crosswords with puns, which are bogus words, without a theme. These gimmicks are hard to detect, and unfortunately CrosswordGiant is unable to detect such cases. This problem is difficult to solve, as it is a linguistical one, and not within the scope of this project. Some introduction is required here. Bogus words follow a theme, and themed crosswords appear on certain days of the week only. By and large, it is reasonable to assume that bogus words appear at random, and are independent between crosswords.
Hence, these will be the steps for cleaning this dataset.
Then, we just merge them.
While doing the project, I found that this dataset was missing some New York Times crossword from around 2000. This missing it is not crucial to the project, and hence can be ignored.
nyt["Day"]=-1
lat["Day"]=-1
usat["Day"]=-1
uni["Day"]=-1
wsj["Day"]=-1 #create new row, set to -1 as we do not know yet
def dayOfDate(row):
months=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
m,d,y=row.Date.split(" ")
ts=pd.Timestamp(year=int(y),month=months.index(m)+1,day=int(d))
row.Day=ts.dayofweek
row.Date=ts
return row
nyt=nyt.apply(dayOfDate,axis=1)
lat=lat.apply(dayOfDate,axis=1)
wsj=wsj.apply(dayOfDate,axis=1)
usat=usat.apply(dayOfDate,axis=1)
uni=uni.apply(dayOfDate,axis=1) #this code takes long to run, due to the huge dataset
#thats why it has also been saved to a new file
crosswords=pd.concat([nyt,lat,uni,usat,wsj])
crosswords.drop_duplicates(inplace=True)
crosswords.to_csv("Clean XWs.csv")
crosswords=pd.read_csv("Clean XWs.csv",index_col=0)
crosswords.drop_duplicates(inplace=True)
badEntries=crosswords.groupby(["Outlet","Date"])[["Answers"]].count()
badEntries.reset_index(inplace=True)
badEntries=badEntries[badEntries.Answers>150] #remove crosswords with more than 150 answers, those are incorrect entries
badEntries.head()
| Outlet | Date | Answers | |
|---|---|---|---|
| 5654 | New York Times | 1997-03-09 | 170 |
| 5738 | New York Times | 1997-06-01 | 162 |
| 5808 | New York Times | 1997-08-10 | 170 |
| 5948 | New York Times | 1997-12-28 | 171 |
| 5969 | New York Times | 1998-01-18 | 164 |
crosswords["Date"]=pd.to_datetime(crosswords["Date"])
crosswords["Year"]=crosswords["Date"].dt.strftime("%Y")
crosswords["Year"]=crosswords["Year"].astype(int)
for i,row in badEntries.iterrows():
crosswords=crosswords[(crosswords.Outlet!=row.Outlet) | (crosswords.Date!=row.Date)]
#remove bad crossword entries
Curiously, their formatting is sometimes irregular. The numbering system calls the first puzzle "the debut puzzle" and others "puzzle # n". For puzzle number, this is an easy fix. They also call people Mr or Ms, depending on their gender. This was easy to replace. The harder part however, was the inconsistencies in their data. For some people, Their name was used instead of Mr X or Ms X. Since those are very few, I have taken the step to clean it by hand.
There is a person's gender as "A". This arose due to how the scraping was done. Visiting the website, I have found the persons name and found out that their name is male.
xwi.replace("puzzle # ","",regex=True,inplace=True)
xwi.replace("the debut puzzle","1",inplace=True)
xwi.replace("Mr","M",inplace=True)
xwi.replace("Ms","F",inplace=True)
xwi.replace("Jakob Weisblat","M",inplace=True)
xwi.replace("Pao Roy","M",inplace=True)
xwi.replace("Emet Ozar","F",inplace=True)
xwi.replace("A","M",inplace=True) # A. Tariq
#convert the puzzle number and their gender
xwi=xwi[~xwi["Date"].str.contains("is")]
xwi["Date"]=pd.to_datetime(xwi["Date"])
#filtering out the valid entries, as some have not been updated, since the scraping method was to overshoot the date
The interesting thing about Google NGram is that it returns different results based on the capitalisation of the word. Hence, I tried both all caps and no caps form of the word. This has yielded a DataFrame with two of the same words. This is largely easy to clean, we just need to add the two columns together. The function used yields a nice sorted order.
However, DataFrames are a terrible lookup table, hence I have chosen to convert them into dictionaries
ngram.columns=[x.upper() for x in ngram.columns]
ngram=ngram.groupby(lambda x:x, axis=1).sum()
ngram.to_csv("Clean NGram.csv")
ngram.head()
| AAA | AAAS | AAH | AAHED | AAHS | AANDE | AAR | AARE | AARGH | AARON | ... | ZONES | ZONK | ZOO | ZOOM | ZOOS | ZOOT | ZORRO | ZOWIE | ZSA | ZULU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | |||||||||||||||||||||
| 1990 | 0.000003 | 4.776738e-07 | 8.037317e-08 | 1.080322e-08 | 2.915548e-08 | 3.935635e-10 | 7.202219e-07 | 1.310561e-08 | 1.340437e-09 | 2.204129e-07 | ... | 0.000022 | 3.301622e-09 | 0.000002 | 0.000001 | 6.868482e-07 | 5.572532e-08 | 1.697215e-08 | 2.356272e-09 | 2.278728e-09 | 3.799622e-08 |
| 1991 | 0.000003 | 4.676968e-07 | 8.048573e-08 | 1.121164e-08 | 3.127718e-08 | 4.146365e-10 | 7.203304e-07 | 1.271489e-08 | 1.632048e-09 | 2.189614e-07 | ... | 0.000022 | 3.280857e-09 | 0.000002 | 0.000001 | 6.796661e-07 | 5.810294e-08 | 1.706612e-08 | 2.296641e-09 | 2.178643e-09 | 3.731547e-08 |
| 1992 | 0.000003 | 4.477974e-07 | 8.223795e-08 | 1.160830e-08 | 3.238063e-08 | 3.912581e-10 | 7.150338e-07 | 1.272449e-08 | 1.558918e-09 | 2.113534e-07 | ... | 0.000022 | 3.240615e-09 | 0.000002 | 0.000001 | 6.747638e-07 | 8.753073e-08 | 1.745364e-08 | 2.390806e-09 | 2.133923e-09 | 3.728858e-08 |
| 1993 | 0.000003 | 4.544412e-07 | 8.328433e-08 | 1.200256e-08 | 3.395920e-08 | 4.073982e-10 | 7.179115e-07 | 1.231911e-08 | 1.673835e-09 | 2.096335e-07 | ... | 0.000021 | 3.327646e-09 | 0.000002 | 0.000001 | 6.718674e-07 | 8.726372e-08 | 1.813236e-08 | 2.373550e-09 | 2.035918e-09 | 3.957867e-08 |
| 1994 | 0.000003 | 4.548627e-07 | 8.400814e-08 | 1.221837e-08 | 3.563201e-08 | 4.240070e-10 | 7.143078e-07 | 1.225623e-08 | 1.638794e-09 | 2.121875e-07 | ... | 0.000021 | 3.436024e-09 | 0.000002 | 0.000001 | 6.640938e-07 | 8.800427e-08 | 1.838808e-08 | 2.405292e-09 | 1.947023e-09 | 3.881467e-08 |
5 rows × 15571 columns
ngramDict=ngram.to_dict()
ngramDict["AAA"][2000]
3.226014294971885e-06
Although data cleaning is sparse in this project, it is compensated by the large amount of transformation of data in the EDA. This is caused by the data being scraped and concrete research into this niche area being rather lacking.
Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.
Firstly, let us define "short answers" as anything with at most 7 letters, and "long answers" as anything with at least 8 letters For each crossword, we will do the following:
We will obtain a DataFrame with the following: Day, Date, Outlet, Score, AScore, BScore This is our base data for graphing. This data is saved in "Q1 Data.csv" Using the score, we can determine how much crosswordese is in it generally. The higher the score, the better the puzzle. Then, we plot to observe any trends. There is a limitation to this method. Some words are not found on the list and as such, I am unable to score them properly, hence I chose to give it a score of 0, as a baseline. This problem appears more when using the NGram dataset, as there are less entries and it is less extensive. However, it still provides a reasonably good image, as crosswords should be affected similarly by missing entries.
def scoreNGram(row):
year=row.Date.year
if year>2019: year=2019 #we do not have ngram data for after 2020
try:
row.Score=ngram[row.Answers][year]*10000
except KeyError: #this word is not within ngram
row.Score=1e-3
return row
dictA=listA.to_dict()
dictB=listB.to_dict()
def scoreWordlistA(row):
global dictA
try:
row.AScore=dictA["Score"][row.Answers]
except KeyError:
row.AScore=0
return row
def scoreWordlistB(row):
global dictB
try:
row.BScore=dictB["Score"][row.Answers]
except KeyError:
row.BScore=0
return row
#just lookup using the wordlists
short=crosswords[crosswords.Answers.str.len()<=7]
short["AScore"]=0
short["BScore"]=0
short["Score"]=0
short=short.apply(scoreWordlistA,axis=1)
short=short.apply(scoreWordlistB,axis=1)
short=short.apply(scoreNGram,axis=1) #score it against the wordlist
short=short.groupby(["Outlet","Date","Day"])[["Score","AScore","BScore"]].agg([sum,"count"])
short.columns = [''.join(col) for col in short.columns]
short.drop(columns=["Scorecount","AScorecount"],inplace=True)
short.rename(columns={"Scoresum":"Score","AScoresum":"AScore","BScoresum":"BScore","BScorecount":"Count"},inplace=True)
short=short.reset_index() #sum it up
short["AAScore"]=short["AScore"]/short["Count"]
short["BBScore"]=short["BScore"]/short["Count"]
short["NNScore"]=short["Score"]/short["Count"] #find mean of the score
short.Date=pd.to_datetime(short.Date)
short["Year"]=short["Date"].dt.strftime("%Y")
short["Year"]=short["Year"].astype(int)
short["Month"]=short["Date"].dt.strftime("%m")
short["Month"]=short["Month"].astype(int) #extract info about the date
short.to_csv("ShortScore.csv") #save the info
#this block of code transforms and forms the underlying dataset for this question
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\369329642.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy short["AScore"]=0 C:\Users\USER\AppData\Local\Temp\ipykernel_13040\369329642.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy short["BScore"]=0 C:\Users\USER\AppData\Local\Temp\ipykernel_13040\369329642.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy short["Score"]=0
sns.stripplot(x="Day",y="AAScore",data=short)
plt.title("Score using wordlist A against the day of week")
plt.ylim((40,80))
plt.ylabel("Score")
pass
In the above graph, there does not seem to be much correlation between the day of week and how much crosswordese is present when we test it with wordlist A.
sns.stripplot(x="Day",y="NNScore",data=short)
plt.title("Short score using NGram against the day of week")
plt.ylabel("Short score")
Text(0, 0.5, 'Short score')
plt.figure(figsize=(12,5))
sns.scatterplot(x="Year",y="NNScore",data=short)
plt.title("Short score using NGram against the year")
plt.ylabel("Short score")
Text(0, 0.5, 'Short score')
Somehow, when using the NGram dataset, which shows real life usage of these words, an interesting trend emerges. There seems to be 3 distinct sections of the stripplot. A more detailed discussion will be included in the results.
plt.figure(figsize=(12,5))
plt.ylim((58,70))
sns.lineplot(x="Year",y="AAScore",data=short,hue="Outlet")
plt.ylabel("Short score using wordlist")
plt.title("Short scores of crosswords using wordlist over the years")
pass
sns.displot(data=short,x="Date",y="AAScore",aspect=0.8,height=5)
plt.ylim((40,80))
plt.ylabel("Short score")
plt.title("Short score in crosswords by wordlists throughout the years")
plt.xlabel("Year")
pass
sns.displot(data=short,x="Date",y="NNScore",aspect=0.8,height=5)
plt.ylim((0,8))
plt.ylabel("NGram short score")
plt.title("Amount of crosswordese in crosswords throughout the years by NGram")
plt.xlabel("Year")
Text(0.5, 6.79999999999999, 'Year')
plt.figure(figsize=(12,5))
sns.lineplot(data=short,x="Year",y="AAScore")
plt.ylim((50,80))
plt.ylabel("Short score")
plt.title("Short score in crosswords throughout the years")
pass
plt.figure(figsize=(12,5))
sns.lineplot(data=short,x="Year",y="NNScore",hue="Outlet",ci=None)
plt.ylabel("NGram short score")
plt.title("NGram short score in various outlet's crosswords throughout the years")
pass
We can see that there is a slight trend upwards, showing improvement in the accessibility in terms of wordlist metrics. However, such a trend cannot be observed with NGram data. The dip in 2000 can be explained by missing data.
The proceedure to answer this question will be similar to Q1, where the wordlist will be used for comparison. For the wordlist score, I will just use it accordingly. However, for my own scoring of freshness of long answer, I will be scoring it on a harmonic scale, with the i th occurance of the answer having a score of 1/i. Then, to evaluate the score of the crossword, I will just sum it up. Also, since long answers are rather rare in a crossword, counting them makes sense, so that will be taken into account too.
Afterwards, these points will just be plotted, to see if there are any trends to be spotted
long=crosswords[crosswords.Answers.str.len()>=8]
long.reset_index(inplace=True,drop=True)
long["Score"]=0
long["AScore"]=0
long["BScore"]=0
long
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\4058804564.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy long["Score"]=0 C:\Users\USER\AppData\Local\Temp\ipykernel_13040\4058804564.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy long["AScore"]=0 C:\Users\USER\AppData\Local\Temp\ipykernel_13040\4058804564.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy long["BScore"]=0
| Clues | Answers | Outlet | Date | Day | Year | Score | AScore | BScore | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Big name in video rentals | BLOCKBUSTER | New York Times | 1997-01-27 | 0 | 1997 | 0 | 0 | 0 |
| 1 | Cyclotron | ATOMSMASHER | New York Times | 1997-01-27 | 0 | 1997 | 0 | 0 | 0 |
| 2 | Exhausting task | BACKBREAKER | New York Times | 1997-01-27 | 0 | 1997 | 0 | 0 | 0 |
| 3 | Yegg | SAFECRACKER | New York Times | 1997-01-27 | 0 | 1997 | 0 | 0 | 0 |
| 4 | 'Little' extraterrestrials | GREENMEN | New York Times | 1997-01-28 | 1 | 1997 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 223495 | Amusing poser | BRAINTEASER | Wall Street Journal | 2022-07-06 | 2 | 2022 | 0 | 0 | 0 |
| 223496 | Bob Feller and Nolan Ryan, by reputation | FLAMETHROWERS | Wall Street Journal | 2022-07-06 | 2 | 2022 | 0 | 0 | 0 |
| 223497 | Feeding the hungry, say | ACTOFMERCY | Wall Street Journal | 2022-07-06 | 2 | 2022 | 0 | 0 | 0 |
| 223498 | One whose beliefs could use some rounding out? | FLATEARTHER | Wall Street Journal | 2022-07-06 | 2 | 2022 | 0 | 0 | 0 |
| 223499 | Station posting | TRAINSCHEDULE | Wall Street Journal | 2022-07-06 | 2 | 2022 | 0 | 0 | 0 |
223500 rows × 9 columns
dict={}
def longScore(row):
global dict
ans=row.Answers
try:
row.Score+=1/(1+dict[ans])
dict[ans]+=1
except KeyError:
dict[ans]=1
row.Score+=1
return row
long=long.apply(longScore,axis=1)
long=long.apply(scoreWordlistA,axis=1)
long=long.apply(scoreWordlistB,axis=1)
long=long.groupby(["Outlet","Date","Day"])[["Score","AScore","BScore"]].agg([sum,"count"])
long=long.reset_index()
long.columns = [''.join(col) for col in long.columns]
long.drop(columns=["Scorecount","AScorecount"],inplace=True)
long.rename(columns={"Scoresum":"Score","AScoresum":"AScore","BScoresum":"BScore","BScorecount":"Count"},inplace=True)
long
| Outlet | Date | Day | Score | AScore | BScore | Count | |
|---|---|---|---|---|---|---|---|
| 0 | L.A. Times Daily | 2005-07-02 | 5 | 7.250000 | 405 | 360 | 9 |
| 1 | L.A. Times Daily | 2005-07-03 | 6 | 7.000000 | 565 | 370 | 10 |
| 2 | L.A. Times Daily | 2005-07-04 | 0 | 2.767857 | 385 | 350 | 7 |
| 3 | L.A. Times Daily | 2005-07-05 | 1 | 4.833333 | 339 | 250 | 6 |
| 4 | L.A. Times Daily | 2005-07-06 | 2 | 2.750000 | 290 | 220 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 28236 | Wall Street Journal | 2022-06-22 | 2 | 6.000000 | 96 | 170 | 6 |
| 28237 | Wall Street Journal | 2022-06-25 | 5 | 14.697169 | 1000 | 1000 | 26 |
| 28238 | Wall Street Journal | 2022-06-27 | 0 | 3.000000 | 255 | 180 | 4 |
| 28239 | Wall Street Journal | 2022-07-02 | 5 | 12.116667 | 360 | 350 | 16 |
| 28240 | Wall Street Journal | 2022-07-06 | 2 | 4.458333 | 388 | 110 | 6 |
28241 rows × 7 columns
long.Date=pd.to_datetime(long.Date)
long["Year"]=long["Date"].dt.strftime("%Y")
long["Year"]=long.Year.astype(int)
long["Month"]=long["Date"].dt.strftime("%m")
long.Month=long.Month.astype(int)
long.to_csv("longScore.csv")
plt.figure(figsize=(12,5))
sns.stripplot( y="Score", x="Day", data=long,
jitter=0.2, alpha=0.5)
plt.ylabel("Own Long Score")
plt.title("Own Long Score against day")
pass
plt.figure(figsize=(12,5))
sns.stripplot( y="AScore", x="Day", data=long,
jitter=0.2, alpha=0.5)
plt.ylabel("Wordlist Long Score")
plt.title("Wordlist Long Score against day")
Text(0.5, 1.0, 'Wordlist Long Score against day')
Through the week, we can see that the long score increases.
plt.figure(figsize=(12,5))
sns.displot(data=long,x="Date",y="AScore",aspect=0.8,height=7)
plt.ylim(top=1500)
plt.ylabel("Long Score")
plt.title("Displot of long score against time")
Text(0.5, 1.0, 'Displot of long score against time')
<Figure size 864x360 with 0 Axes>
The max of the long score seems to be increasing.
plt.figure(figsize=(12,5))
sns.lineplot(x="Year",y="AScore",data=long,hue="Outlet")
plt.ylabel("Long Score")
plt.title("Lineplot of long score of various outlet's crosswords over time")
Text(0.5, 1.0, "Lineplot of long score of various outlet's crosswords over time")
In general, we can see that long score has been increasing, except in the case of Wall Street Journal, which has been decreasing.
plt.figure(figsize=(12,5))
sns.boxplot(x="Month",y="AScore",data=long,hue="Outlet")
plt.ylabel("Long Score")
plt.title("Boxplot of long score against the month")
Text(0.5, 1.0, 'Boxplot of long score against the month')
There seems to be no correlation between month and the quality of crosswords. This is expected.
plt.figure(figsize=(12,5))
sns.boxenplot(y="AScore",data=long,x="Outlet")
plt.ylabel("Long Score")
plt.title("Boxenplot of long score against outlet")
Text(0.5, 1.0, 'Boxenplot of long score against outlet')
plt.figure(figsize=(12,5))
sns.stripplot(data=long,y="Count",x="Outlet")
plt.title("Number of long answers for each outlet")
Text(0.5, 1.0, 'Number of long answers for each outlet')
We can see that New York Times is the best for long answers, followed by LA Times and Wall Street Journal, then Universal and USA Today. In a similar fashion, New York Times has the most long answers, followed by Wall Street Journal, then LA Times, then Universal, then USA Today.
long["Year2"]=(long["Year"].astype(int))//5*5
plt.figure(figsize=(12,5))
sns.boxenplot(x="Year2",y="AScore",data=long)
plt.ylabel("Long Score")
plt.title("Boxenplot of long score against year")
Text(0.5, 1.0, 'Boxenplot of long score against year')
No significant trend can be seen with this boxplot against time.
plt.figure(figsize=(12,5))
sns.regplot(data=long,x="AScore",y="BScore")
plt.title("The scores of the two wordlists are generally similar")
pass
We can see that the long scoring between the two wordlists is similar and does not differ too much. It is thus reasonable to assume that changing the wordlist will not affect the results significantly. Hence our results are not that bad.
In this section, I will be analysing how inclusive clues are over the years. I will first concatanate all the words in the crossword, then for each word, search for it in the namelist. For each match, assign one point, then we can plot some trends. The two namelists used will be from baby websites. Of course, this method is limited by the namelist, however, with 1000 names for each gender, it should be fairly robust.
sample=crosswords.copy()
sample["Words"]=sample["Clues"]+' '+sample["Answers"]
sample.Words=sample.Words.astype(str)
names=sample.groupby(['Outlet','Date'])['Words'].apply(' '.join).reset_index()
names["Words"]=names["Words"].str.replace("[\"_.“()”!:']+",regex=True,repl="")
names["Words"]=names["Words"].str.split("[ ]+",regex=True).to_frame().reset_index(drop=True)
names
| Outlet | Date | Words | |
|---|---|---|---|
| 0 | L.A. Times Daily | 2005-07-02 | [A, Natural, Man, singer, RAWLS, Card, Players... |
| 1 | L.A. Times Daily | 2005-07-03 | [, from, New, York, show,, briefly, SNL, Got, ... |
| 2 | L.A. Times Daily | 2005-07-04 | [Fine, studies, ARTS, Not, guilty,, eg, PLEA, ... |
| 3 | L.A. Times Daily | 2005-07-05 | [What, a, relief, WHEW, Is, Born, ASTAR, Actor... |
| 4 | L.A. Times Daily | 2005-07-06 | [Dont, bother, SKIPIT, Even, speak, ASWE, Gran... |
| ... | ... | ... | ... |
| 28289 | Wall Street Journal | 2022-06-22 | [Toosie, Slide, rapper, DRAKE, AWOL, chasers, ... |
| 28290 | Wall Street Journal | 2022-06-25 | [Dude, BRO, Force, Behind, the, Forces, grp, U... |
| 28291 | Wall Street Journal | 2022-06-27 | [Winnie, Pu, first, Latin, bestseller, in, the... |
| 28292 | Wall Street Journal | 2022-07-02 | [2001, computer, HAL, Bravo, OLE, Can, I, get,... |
| 28293 | Wall Street Journal | 2022-07-06 | [Central, Park, in, the, Dark, composer, IVES,... |
28294 rows × 3 columns
gNames=list(val.lower() for val in gNames)
gNames=set(gNames)
def gNameSearch(row):
for word in row.Words:
if word.lower() in gNames:
row.GScore+=1
return row
bNames=list(val.lower() for val in bNames)
bNames=set(bNames)
def bNameSearch(row):
for word in row.Words:
if word.lower() in bNames:
row.BScore+=1
return row
names["BScore"]=0
names["GScore"]=0
names=names.apply(bNameSearch,axis=1)
names=names.apply(gNameSearch,axis=1)
names["Year"]=names["Date"].dt.strftime("%Y")
names["Year"]=names["Year"].astype(int)
plt.figure(figsize=(12,5))
names["Total"]=names["BScore"]+names["GScore"]
sns.lineplot(data=names[names.Outlet!="Wall Street Journal"],x="Year",y="Total",hue="Outlet",ci=None)
plt.ylabel("No. of Names")
plt.title("Occurrences of names in various outlet's crosswords over the years")
Text(0.5, 1.0, "Occurrences of names in various outlet's crosswords over the years")
We can see that, in general, the number of names being used is decreasing. This may also be an effect of accessibility, trying to make it more about words than obscure celebrities. Wall Street Journal was removed from this comparison as it had values too high to scale the graph to be unreadable.
plt.figure(figsize=(12,5))
sns.lineplot(data=names,x="Year",y="BScore",color="b")
sns.lineplot(data=names,x="Year",y="GScore",color="pink")
plt.ylabel("Occurrences of Gendered Names")
plt.title("Occurrences of gendered names in crosswords over the years")
pass
plt.figure(figsize=(12,5))
sns.lineplot(data=names[names.Outlet=="Wall Street Journal"],x="Year",y="BScore",color="b")
sns.lineplot(data=names[names.Outlet=="Wall Street Journal"],x="Year",y="GScore",color="pink")
plt.ylabel("Occurrences of Gendered Names")
plt.title("Occurrences of gendered names in Wall Street Journal crosswords over the years")
pass
plt.figure(figsize=(12,5))
sns.lineplot(data=names[names.Outlet=="USA Today"],x="Year",y="BScore",color="b")
sns.lineplot(data=names[names.Outlet=="USA Today"],x="Year",y="GScore",color="pink")
plt.ylabel("occurrences of Gendered Names")
plt.title("occurrences of gendered names in USA Today crosswords over the years")
pass
For Wall Street Journal, they have been constantly decreasing the number of names in their crosswords as well. In general, we find that outlets have been including more male names than female names. But, this is not true for USA Today. Surprisingly, we find that now, the number of female names appear more than male names. This interesting observation will be dicussed in greater detail in the results section.
plt.figure(figsize=(12,5))
sns.boxplot(data=names,x="Outlet",y="Total")
plt.ylabel("Number of names")
plt.title("Number of names in crosswords per outlet")
Text(0.5, 1.0, 'Number of names in crosswords per outlet')
Most outlets use few names in their crosswords, except for the Wall Street Journal, which uses them more often than the rest.
names
| Outlet | Date | Words | BScore | GScore | Year | Total | |
|---|---|---|---|---|---|---|---|
| 0 | L.A. Times Daily | 2005-07-02 | [A, Natural, Man, singer, RAWLS, Card, Players... | 4 | 2 | 2005 | 6 |
| 1 | L.A. Times Daily | 2005-07-03 | [, from, New, York, show,, briefly, SNL, Got, ... | 12 | 7 | 2005 | 19 |
| 2 | L.A. Times Daily | 2005-07-04 | [Fine, studies, ARTS, Not, guilty,, eg, PLEA, ... | 3 | 4 | 2005 | 7 |
| 3 | L.A. Times Daily | 2005-07-05 | [What, a, relief, WHEW, Is, Born, ASTAR, Actor... | 4 | 3 | 2005 | 7 |
| 4 | L.A. Times Daily | 2005-07-06 | [Dont, bother, SKIPIT, Even, speak, ASWE, Gran... | 5 | 4 | 2005 | 9 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 28289 | Wall Street Journal | 2022-06-22 | [Toosie, Slide, rapper, DRAKE, AWOL, chasers, ... | 2 | 4 | 2022 | 6 |
| 28290 | Wall Street Journal | 2022-06-25 | [Dude, BRO, Force, Behind, the, Forces, grp, U... | 6 | 5 | 2022 | 11 |
| 28291 | Wall Street Journal | 2022-06-27 | [Winnie, Pu, first, Latin, bestseller, in, the... | 3 | 1 | 2022 | 4 |
| 28292 | Wall Street Journal | 2022-07-02 | [2001, computer, HAL, Bravo, OLE, Can, I, get,... | 14 | 5 | 2022 | 19 |
| 28293 | Wall Street Journal | 2022-07-06 | [Central, Park, in, the, Dark, composer, IVES,... | 5 | 3 | 2022 | 8 |
28294 rows × 7 columns
Now, we combine the genders of the constructors with the NYT crosswords and do some analysis. This is putting all the results together, making use of everything before. We will use the results from Q1,Q2 and Q3 to assist us in our exploration. First, let us combine all the results into one dataframe.
short2=short.rename(columns={"AAScore":"AShort","BBScore":"BShort","NNScore":"NShort"})
short2.rename(columns={"AScore":"AShortSum","BScore":"BShortSum","Score":"NShortSum"},inplace=True)
long2=long.rename(columns={"AScore":"ALong","BScore":"BLong","Score":"NLong"})
names2=names.rename(columns={"BScore":"BNames","GScore":"GNames"})
short2.drop(columns=["Count","Day","Year","Month"],inplace=True)
long2.drop(columns=["Year"],inplace=True)
xwData=pd.merge(short2,long2,how="outer",on=["Outlet","Date"])
xwData=pd.merge(xwData,names2,how="outer",on=["Outlet","Date"])
xwData.to_csv("Q1Q2Q3.csv")
xwData=pd.read_csv("Q1Q2Q3.csv",index_col=0)
Also, I want to plot some trends involving Q1, Q2 and Q3, to assist in this question. Unfortunately, no visible trends can be spotted here.
sns.pairplot(data=xwData[["AShort","ALong","BNames","GNames"]])
<seaborn.axisgrid.PairGrid at 0x1e2f04320d0>
For this question, we only have data from the NYT, hence we need to slice the dataframe and merge it with the constructor info.
xwData["Date"]=pd.to_datetime(xwData["Date"])
xwData.drop(columns=["Day"],inplace=True)
nytData=pd.merge(xwi,xwData[xwData.Outlet=="New York Times"],on="Date",how="inner")
nytData.drop(columns=["Outlet"],inplace=True) #we already know its from New York Times
nytData=nytData[nytData.Month.notna()] # we have some na entries
nytData["Month"]=nytData["Month"].astype(int)
nytData["Year"]=nytData["Year"].astype(int)
nytData["C1 No."]=nytData["C1 No."].astype(int,errors="ignore")
nytData.reset_index(inplace=True,drop=True)
nytData.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8210 entries, 0 to 8209 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Day 8210 non-null object 1 Date 8210 non-null datetime64[ns] 2 C1 No. 8210 non-null int32 3 C1 Gender 8210 non-null object 4 C2 No. 8210 non-null object 5 C2 Gender 8210 non-null object 6 C3 No. 8210 non-null object 7 C3 Gender 8210 non-null object 8 NShortSum 8210 non-null float64 9 AShortSum 8210 non-null int64 10 BShortSum 8210 non-null int64 11 AShort 8210 non-null float64 12 BShort 8210 non-null float64 13 NShort 8210 non-null float64 14 NLong 8210 non-null float64 15 ALong 8210 non-null float64 16 BLong 8210 non-null float64 17 Count 8210 non-null float64 18 Month 8210 non-null int32 19 Year2 8210 non-null float64 20 Words 8210 non-null object 21 BNames 8210 non-null int64 22 GNames 8210 non-null int64 23 Year 8210 non-null int32 24 Total 8210 non-null int64 dtypes: datetime64[ns](1), float64(9), int32(3), int64(5), object(7) memory usage: 1.5+ MB
plt.figure(figsize=(12,5))
sns.countplot(data=nytData,x="Year",hue="C1 Gender")
plt.title("Number of crosswords constructed each year for the NYT, split by gender")
Text(0.5, 1.0, 'Number of crosswords constructed each year for the NYT, split by gender')
We can see that, the crossword scene in the New York Times is primarily male dominated.
daysOfWeek=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
plt.figure(figsize=(12,5))
sns.countplot(data=nytData,x="Day",hue="C1 Gender")
plt.title("Amount of constructors for each day of the week, split by gender")
Text(0.5, 1.0, 'Amount of constructors for each day of the week, split by gender')
As the week goes on, more and more male constructors appear, and unfortunately, the number of female constructors decrease. Except for Sunday, which has a similar difficulty to Wed/Thurs puzzles. One may infer that the tougher difficulty may cause more female constructors to not construct.
sns.lineplot(data=nytData,x="Year",y="NShort",hue="C1 Gender")
plt.title("NGram short score against year, split by gender")
plt.ylabel("NGram short score")
Text(0, 0.5, 'NGram short score')
sns.lineplot(data=nytData,x="Year",y="NLong",hue="C1 Gender")
plt.ylabel("Self-Long Score")
plt.title("Self-long score against year, split by gender")
Text(0.5, 1.0, 'Self-long score against year, split by gender')
sns.lineplot(data=nytData,x="Year",y="AShort",hue="C1 Gender")
plt.ylabel("Wordlist Short Score")
plt.title("Wordlist short score against year, split by gender")
Text(0.5, 1.0, 'Wordlist short score against year, split by gender')
sns.lineplot(data=nytData,x="Year",y="ALong",hue="C1 Gender")
plt.ylabel("Wordlist Long Score")
plt.title("Wordlist long score against year, split by gender")
Text(0.5, 1.0, 'Wordlist long score against year, split by gender')
We can see that crosswordese generally remains similar, but freshness is higher among men
sns.boxplot(data=nytData,x="C1 Gender",y="AShort")
plt.ylim((40,65))
plt.ylabel("Wordlist Short Score")
plt.title("Boxplot of wordlist short score, split by gender")
Text(0.5, 1.0, 'Boxplot of wordlist short score, split by gender')
Generally, using wordlists, there seems to be no trend between the crosswordese amount between men and women.
sns.boxplot(data=nytData,x="C1 Gender",y="ALong")
plt.ylabel("Wordlist Long Score")
plt.title("Boxplot of wordlist long score, split by gender")
Text(0.5, 1.0, 'Boxplot of wordlist long score, split by gender')
sns.boxplot(data=nytData,x="C1 Gender",y="NLong")
plt.ylabel("Own Metric Long Score")
plt.title("Boxplot of own metric long score, split by gender")
Text(0.5, 1.0, 'Boxplot of own metric long score, split by gender')
Using wordlists and NGrams, men have a higher long answer score then women.
sns.lineplot(data=nytData,y="BNames",x="Year",hue="C1 Gender")
plt.ylabel("No. of boy names")
plt.title("No. of boy names in NYT Crosswords over the years")
plt.ylim((3,8))
pass
sns.lineplot(data=nytData,y="GNames",x="Year",hue="C1 Gender")
plt.ylabel("No. of girl names")
plt.title("No. of girl names in NYT Crosswords over the years")
plt.ylim((3,8))
pass
Between men and women, they use similar number of boy's names used. However, women are generally more inclined to use girl names.
plt.figure(figsize=(12,5))
sns.scatterplot(data=nytData[nytData["C2 Gender"]=="-"],x="C1 No.",y="AShort")
plt.title("Short Score of constructors by experience")
plt.ylabel("Short Score")
Text(0, 0.5, 'Short Score')
plt.figure(figsize=(12,5))
sns.scatterplot(data=nytData[nytData["C2 Gender"]=="-"],x="C1 No.",y="ALong")
plt.title("Long Score of constructors by experience")
plt.ylabel("Long Score")
pass
From the two graphs, we can see that as a constructor makes more puzzles, their "worst" puzzle scores increases, meaning that they become more consistent in making puzzles.
plt.figure(figsize=(12,5))
sns.lineplot(data=nytData[(nytData["C1 No."]<=10)],x="Year",y="AShort")
sns.lineplot(data=nytData[(nytData["C1 No."]>10)],x="Year",y="AShort",color="g")
plt.ylabel("Short Score")
plt.title("Newer constructors have quite similar short scores to seasoned ones")
Text(0.5, 1.0, 'Newer constructors have quite similar short scores to seasoned ones')
plt.figure(figsize=(12,5))
sns.lineplot(data=nytData[(nytData["C1 No."]<=10)],x="Year",y="ALong")
sns.lineplot(data=nytData[(nytData["C1 No."]>10)],x="Year",y="ALong",color="g")
plt.ylabel("Long Score")
plt.title("Newer constructors have higher long scores to seasoned ones")
Text(0.5, 1.0, 'Newer constructors have higher long scores to seasoned ones')
plt.figure(figsize=(12,5))
nytData["C1 Bin"]=nytData["C1 No."]//10*10
sns.lineplot(data=nytData,x="C1 Bin",y="BNames")
sns.lineplot(data=nytData,x="C1 Bin",y="GNames",color="pink")
plt.title("Usage of names against constructor number")
plt.ylabel("Number of names")
plt.xlabel("Number of previous puzzles constructed")
Text(0.5, 0, 'Number of previous puzzles constructed')
As someone constructs more puzzles, the number of names they use decreases. This suggests that they want to make their puzzles more accessible. Additionally, the difference in number of gendered names they use becomes more and more similar, suggesting that they may be striving to be more inclusive.
plt.figure(figsize=(12,5))
sns.displot(data=nytData,x="Date",y="C1 No.")
plt.ylabel("Number of previous puzzles constructed")
plt.title("Constructor number against year")
Text(0.5, 1.0, 'Constructor number against year')
<Figure size 864x360 with 0 Axes>
This graph just shows that constructor number increases with time, which is an expected trend. It also shows how constructors keep returning to the New York Times. However, a majority also only have 1 puzzle in the New York Times, as we can see in the more darkly colored section closer to 0.
plt.figure(figsize=(16,5))
sns.displot(data=nytData,x="Day",y="C1 No.",row_order=daysOfWeek,height=8)
plt.ylabel("Number of previous puzzles constructed")
plt.title("Constructor number against the day of week they get published")
Text(0.5, 1.0, 'Constructor number against the day of week they get published')
<Figure size 1152x360 with 0 Axes>
plt.figure(figsize=(16,5))
sns.displot(data=nytData[nytData["C1 No."]<10],x="Day",y="C1 No.",height=8,row_order=daysOfWeek)
plt.ylabel("Number of previous puzzles constructed")
plt.title("Constructor number against the day of week they get published")
Text(0.5, 1.0, 'Constructor number against the day of week they get published')
<Figure size 1152x360 with 0 Axes>
Unfortuately, it seems that seaborn has a bug which prevents me from sorting the row properly. In general, we find that as the week goes from Monday to Saturday, the number of new constructors decreases. The abnomality here is the Sunday puzzle.
collab=nytData[nytData["C2 Gender"]!="-"]
collab["C2 No."]=collab["C2 No."].astype(int).copy()
collab
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\457256435.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy collab["C2 No."]=collab["C2 No."].astype(int).copy()
| Day | Date | C1 No. | C1 Gender | C2 No. | C2 Gender | C3 No. | C3 Gender | NShortSum | AShortSum | ... | BLong | Count | Month | Year2 | Words | BNames | GNames | Year | Total | C1 Bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | Saturday | 1997-03-29 | 15 | F | 126 | M | - | - | 6.491294 | 3543 | ... | 470.0 | 12.0 | 3 | 1995.0 | ['A', 'kingdom', 'for', 'Henry', 'V', 'ASTAGE'... | 6 | 2 | 1997 | 8 | 10 |
| 190 | Friday | 1997-08-08 | 20 | M | 28 | M | - | - | 6.066707 | 3418 | ... | 450.0 | 12.0 | 8 | 1995.0 | ['Dagnabbit', 'NERTS', 'Enough', 'STOPIT', 'Ki... | 3 | 3 | 1997 | 6 | 20 |
| 296 | Sunday | 1997-11-23 | 8 | M | 1 | M | - | - | 19.031943 | 7435 | ... | 560.0 | 14.0 | 11 | 1995.0 | ['two', 'mints', 'INONE', 'Dallas', 'Miss', 'E... | 16 | 13 | 1997 | 29 | 0 |
| 308 | Friday | 1997-12-05 | 1 | M | 4 | M | - | - | 229.045053 | 3345 | ... | 500.0 | 14.0 | 12 | 1995.0 | ['Primary', 'Colors', 'author,', 'for', 'short... | 5 | 3 | 1997 | 8 | 0 |
| 312 | Tuesday | 1997-12-09 | 16 | F | 127 | M | - | - | 6.758687 | 4398 | ... | 100.0 | 5.0 | 12 | 1995.0 | ['Casablanca', 'role', 'RICK', 'OK,', 'why', '... | 3 | 1 | 1997 | 4 | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8196 | Monday | 2022-06-13 | 3 | M | 2 | F | - | - | 22.822379 | 4761 | ... | 90.0 | 2.0 | 6 | 2020.0 | ['', 'spoon', 'fork?', 'ORA', 'Ah,', 'that', '... | 3 | 7 | 2022 | 10 | 0 |
| 8199 | Thursday | 2022-06-16 | 1 | M | 51 | M | - | - | 47.613915 | 4342 | ... | 230.0 | 9.0 | 6 | 2020.0 | ['', 'but', 'it', 'seems', 'like', 'you', 'hat... | 3 | 2 | 2022 | 5 | 0 |
| 8204 | Friday | 2022-06-24 | 8 | F | 4 | F | - | - | 7.066982 | 3573 | ... | 510.0 | 11.0 | 6 | 2020.0 | ['Stronger', 'than', 'pain', 'sloganeer', 'ADV... | 8 | 5 | 2022 | 13 | 0 |
| 8206 | Sunday | 2022-06-26 | 14 | M | 21 | M | - | - | 27.225994 | 7711 | ... | 420.0 | 16.0 | 6 | 2020.0 | ['Despicable', 'Me', 'antihero', 'GRU', 'Hairs... | 9 | 6 | 2022 | 15 | 10 |
| 8209 | Thursday | 2022-06-30 | 35 | M | 54 | M | - | - | 35.286622 | 4499 | ... | 0.0 | 1.0 | 6 | 2020.0 | [',', 'in', 'emails', 'URGENT', 'but', 'perhap... | 8 | 5 | 2022 | 13 | 30 |
680 rows × 26 columns
sns.scatterplot(data=collab,x="C1 No.",y="C2 No.")
plt.xlabel("Puzzle of the 1st constructor")
plt.ylabel("Puzzle of the 2nd constructor")
plt.title("Scatterplot of the relationship of the puzzle number of collaborators")
Text(0.5, 1.0, 'Scatterplot of the relationship of the puzzle number of collaborators')
There seems to be no correlation between who collaborates with who. However, we can see a clear clustering of points about the x- and y-axes. This suggests that collaborations are mostly used to induct new constructors into the New York Times crossword.
sns.histplot(data=collab,x="Year",color="b",bins=range(1997,2023))
plt.title("Number of collaborations have been increasing year on year")
plt.ylabel("Number of collaborations")
Text(0, 0.5, 'Number of collaborations')
While the number of daily crosswords have remained similar throughout the years, the number of collaborations have increased. This suggests more openness within the community to induct new people and a greater sense of community.
nytData["C1C2 Gender"]=nytData["C1 Gender"].copy()+nytData["C2 Gender"].copy()
plt.figure(figsize=(16,5))
sns.countplot(data=nytData,x="Year",hue="C1C2 Gender")
plt.title("Number of puzzles published by different pairs of constructors over the years")
Text(0.5, 1.0, 'Number of puzzles published by different pairs of constructors over the years')
collab["C1C2 Gender"]=collab["C1 Gender"].copy()+collab["C2 Gender"].copy()
plt.figure(figsize=(16,5))
sns.countplot(data=collab,x="Year",hue="C1C2 Gender")
plt.title("Number of puzzles published by different pairs of constructors over the years")
C:\Users\USER\AppData\Local\Temp\ipykernel_13040\1712906548.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy collab["C1C2 Gender"]=collab["C1 Gender"].copy()+collab["C2 Gender"].copy()
Text(0.5, 1.0, 'Number of puzzles published by different pairs of constructors over the years')
No clear trends can be seen between the genders of collaborators.
sns.lineplot(data=nytData,x="Year",y="AShort")
sns.lineplot(data=collab,x="Year",y="AShort",color="g")
plt.ylabel("Short Score")
plt.title("Short score of crosswords over time, split by collaborations")
Text(0.5, 1.0, 'Short score of crosswords over time, split by collaborations')
sns.lineplot(data=nytData,x="Year",y="ALong")
sns.lineplot(data=collab,x="Year",y="ALong",color="g")
plt.ylabel("Long Score")
plt.title("Long score of crosswords over time, split by collaborations")
Text(0.5, 1.0, 'Long score of crosswords over time, split by collaborations')
sns.lineplot(data=nytData[(nytData["C1 No."]<10)],x="Year",y="AShort")
sns.lineplot(data=collab[(collab["C1 No."]<10) | (collab["C2 No."]<10)],x="Year",y="AShort",color="g")
plt.ylabel("Short Score")
plt.title("Short score of crosswords over time by newer constructors, split by collaborations")
Text(0.5, 1.0, 'Short score of crosswords over time by newer constructors, split by collaborations')
sns.lineplot(data=nytData[(nytData["C1 No."]<10)],x="Year",y="ALong")
sns.lineplot(data=collab[(collab["C1 No."]<10) | (collab["C2 No."]<10)],x="Year",y="ALong",color="g")
plt.ylabel("Long Score")
plt.title("Long score of crosswords by newer constructors over time, split by collaborations")
Text(0.5, 1.0, 'Long score of crosswords by newer constructors over time, split by collaborations')
Collaborations seem to have similar amounts of crosswordese, and increase the freshness of puzzles, which is an overall gain. This is more so seen with newer constructors, which is a benefit, as it makes it easier for them to be accepted by the New York Times.
For this section, Day 0 refers to Monday, going through the week until Day 6, Sunday.
For this section, I define short score as how "good" entries with 7 letters or less are, using a metric of either the scoring given by wordlist makers, or their frequency on Google NGram. For the score, the higher the better. Remember that crosswordese is defined as the amount of obscure fill. The higher the score, the lower the crosswordese.
nrow=1
ncol=2
fig = plt.figure(figsize=(16,5))
gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.5)
axes = gs.subplots(sharex=True, sharey=False)
sns.stripplot(ax=axes[0],x="Day",y="AAScore",data=short,alpha=0.01)
sns.boxplot(ax=axes[0],x="Day",y="AAScore",data=short)
axes[0].set_title("Short score using wordlist A against the day of week")
axes[0].set_ylim((40,80))
axes[0].set_ylabel("Wordlist Short Score")
sns.stripplot(ax=axes[1],x="Day",y="NNScore",data=short)
axes[1].set_title("Short score using NGram against the day of week")
axes[1].set_ylabel("NGram Short Score")
plt.suptitle("Short Scores of crosswords, scored using wordlists and NGram, split by day")
pass
From the wordlist scoring, we can see that the average score of crosswords decreases throughout the week, albeit minimally. This may signal the puzzle being more and more inaccessible to puzzlers, a phenomenon that is caused by editors wishing to cater to the crossword buffs. The wordlist boxplot would show them trying to cater to both newbies and seasoned puzzlers, by making early week puzzles very accessible, and later week ones harder. This is a reasonable compromise by them.
From the NGram score, we can clearly observe that there are three distinct sections in the data. A possible explanation is that it is the fault of the NGram dataset for being more sparse but I suspect that not to be the case. It may be that crosswords are just inaccessible to the general public if they are not within this community, which may explain why crossword scores are higher when the person grading it is a crossword maker themself, and can thus understand the issues. Hence, the wordlist scores are more closely clustered together, as their scoring is rather similar, not deviating by too much, whilst the NGram scores are more far spread since they represent real world use, which is very very different from crosswords.
nrow=1
ncol=2
# make a list of all dataframes
fig = plt.figure(figsize=(16,5))
gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)
axes = gs.subplots(sharex=True, sharey=False)
axes[0].set_ylim((58,70))
sns.lineplot(ax=axes[0],x="Year",y="AAScore",data=short,hue="Outlet")
axes[0].set_ylabel("Short score using wordlist")
axes[0].set_title("Short scores of crosswords using wordlist over the years")
sns.lineplot(ax=axes[1],x="Year",y="NNScore",data=short,hue="Outlet",ci=False)
axes[1].set_ylabel("Short score using NGram")
axes[1].set_title("Short scores of crosswords using NGram over the years")
plt.suptitle("Short scores of crosswords over the years, split by outlet")
Text(0.5, 0.98, 'Short scores of crosswords over the years, split by outlet')
From the wordlist graph, we can generally infer that, by wordlist standards, all the outlets have their score to be increasing. This graph shows some evidence that the amount of crosswordese has decreased throughout the years. This would mean greater accessibility for the average solver, which knows at least some crosswordese.
From the NGram graph, it seems more erratic, with only USA Today having seen a significant increase. This would suggest a total novice, a person that has never seen a puzzle before, could have a much harder time. A moderate amount of knowledge is needed to break into solving crosswords, but that skill floor has decreased over the years.
The interesting outlier here is USA Today. This outlier is the deliberate action. USA Today crosswords touts themselves as being one of the easier crosswords, being a beginner friendly puzzle. Indeed, it has seem to show that, with a great improvement in accessibility in recent years. This big jump does not imply the superiority over other outlets.
Instead, it showcases a compromise that they have made. Traditionally, crosswords are rotationally or reflectionally symmetrical. However, in recent years, they have given up on symmetry, in favour of better fill and less crosswordese. It can be seen that this method works as a compromise, being less elegant, but being able to induct more solvers into the crossword universe, being an overall positive gain.
For this section, I define long score as a metric for freshness, the higher the long score, the more fresh the puzzle is. Higher freshness is better.
nrow=1
ncol=2
# make a list of all dataframes
fig = plt.figure(figsize=(16,5))
gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)
axes = gs.subplots(sharex=True, sharey=False)
plt.figure(figsize=(12,5))
sns.stripplot( y="Score", x="Day", data=long,
jitter=0.2, alpha=0.2,ax=axes[0])
axes[0].set_ylabel("Own long Score")
axes[0].set_title("Own long Score against day")
sns.stripplot( y="AScore", x="Day", data=long,
jitter=0.2, alpha=0.2,ax=axes[1])
axes[1].set_ylabel("Wordlist long Score")
axes[1].set_title("Wordlist long Score against day")
plt.suptitle("Long score of crosswords, split by day of week and metric used")
pass
<Figure size 864x360 with 0 Axes>
As the week goes on, the higher and higher the long score. Obviously, for Sunday puzzles, they will have a higher long score with more grid space. However, for the other days of the week, a different explanation is required. A possible reason is the increasing difficulty of crosswords through the week, especially Friday and Saturday, which are themeless puzzles for some outlets. By being unrestricted from any theme, all their long answers must shine, and there must be more of them, since they have more freedom to grid it. This can be seen using both our own metric and the wordlist, confirming the results. What this means is that, the later week puzzles, even though they are less accessible for newbie solvers, veterans will be satisfied to know that the puzzle is waiting for them with many snazzy answers.
plt.figure(figsize=(12,5))
sns.lineplot(x="Year",y="AScore",data=long,hue="Outlet")
plt.ylabel("Long Score")
plt.title("Lineplot of long score of various outlet's crosswords over time")
Text(0.5, 1.0, "Lineplot of long score of various outlet's crosswords over time")
Based on this line chart alone, we can tell that the New York Times is the best place for fresh fill. This may explain why they are said to be "the gold standard". For some background information, they have the highest rates in the industry, paying about 500USD per puzzle. This potential profit draws many constructors and makes it so the NYT gets more submissions. Then, they have the luxury to prune only the best, which gives them a competitive edge over the other outlets. While most other outlets generally have not seen their freshness change, the New York Times certainly has, as seen here to be above the rest.
The increasing trend is showing that constructors are constantly raising the bar on what they can do and how fresh the puzzles are. We notice that with the rise of computer construction software, it has never been easier to construct crosswords. Trial and error is no longer required, and now constructors can focus solely on making their crossword the best it can be. In my opinion, this graph does reflect such a shift.
Wall Street Journal's decline may be explained by the fact that they have less Sunday crosswords now, which does affect their long score. However, nowadays, it matches with most other outlets.
nrow=2
ncol=3
# make a list of all dataframes
fig = plt.figure(figsize=(16,4))
gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0)
axes = gs.subplots(sharex=True, sharey=True)
# plot counter
import matplotlib.patches as mpatches
outlets=["New York Times","L.A. Times Daily","USA Today","Universal","Wall Street Journal"]
count=0
for r in range(nrow):
for c in range(ncol):
if count==5:
sns.lineplot(ax=axes[1,2],data=names,x="Year",y="BScore",color="b")
sns.lineplot(ax=axes[1,2],data=names,x="Year",y="GScore",color="pink")
axes[1,2].set_title("All Crosswords")
else:
sns.lineplot(ax=axes[r,c],data=names[names.Outlet==outlets[count]],x="Year",y="BScore",color="b")
sns.lineplot(ax=axes[r,c],data=names[names.Outlet==outlets[count]],x="Year",y="GScore",color="pink")
axes[r,c].set_title(outlets[count])
count+=1
axes[r,c].set_ylabel("occurrences")
axes[0,2].legend(handles=[mpatches.Patch(color='b'),mpatches.Patch(color='pink')],labels=["Male","Female"])
plt.suptitle("occurrences of Gendered names in crosswords over the years")
Text(0.5, 0.98, 'occurrences of Gendered names in crosswords over the years')
This graph shows the 5 major outlets and how their occurrences of gendered names have changed throughout the years. New York Times, L.A. Times Daily and Wall Street Journal still use more male names than female names of their crossword, whilst Universal and USA Today seem to be closing the gap. The reason for this change is probably deliberate. At the helm of editors who aim to be more inclusive, these crosswords want to reflect society more fully, trying to get more females to do the puzzle.
Even though its less obvious, the New York Times also seems to be trying to be more inclusive, with a dip in the number of males name used. This may be due to some other factors, which will be discussed in question 4.
On a full scale, the number of gendered names seem to be converging such that there are more female ones and less male ones. This should be celebrated, as it reflects a change in perception in the crossword. Since the crossword somewhat reflects who made the puzzle, the greater similarity in names shows how, the crossword is getting more divrse. No longer is the crossword only for men, but now more people can see themselves in it. That does have an influence on how someone feels, when they see something they identify with rather than a baseball team.
plt.figure(figsize=(12,5))
sns.boxplot(data=names,x="Outlet",y="Total")
plt.ylabel("Number of names")
plt.title("Number of names in crosswords per outlet")
pass
It is also interesting to note that, most outlets use generally about the same number of names. However, the clear outlier here is the Wall Street Journal. With more names being used, it may make it harder for someone who does not recognise them to be able to solve the crossword, going back to the issue of crosswordese. Since the number of names being used hovers around 8-9 as a median, even more so from the Wall Street Journal, it is just pertinent that the crosswords contain a diverse set of names, so that no one group of people feel left out. Names are not just another clue, they have the power to make us connect and feel things.
On a side note, it may be that the Wall Street Journal dataset is skewed by the fact that they have more Sunday crosswords, which are bigger, and hence they may contain more names. If this is the case, it probably would not deviate from the general trend of other crosswords too much.
#sns.countplot(data=nytData,x="Year",hue="C1 Gender")
pivot=pd.concat([nytData.groupby("Year")["C1 Gender"].count()
,nytData[(nytData["C1 Gender"]!="F") & (nytData["C2 Gender"]!="F")].groupby("Year")["C1 Gender"].count()
,nytData[(nytData["C1 Gender"]=="F") | (nytData["C2 Gender"]=="F")].groupby("Year")["C1 Gender"].count()],axis=1)
pivot.columns=["Total","Male","Female"]
plt.figure(figsize=(17,5))
plt.bar([x for x in range(1997,2023) if x!=2000], pivot["Male"]/pivot["Total"], label="Male") #plot the bottom most bar first
plt.bar([x for x in range(1997,2023) if x!=2000], pivot["Female"]/pivot["Total"], label="Female", bottom=pivot["Male"]/pivot["Total"])
#missing data in 2000
plt.title("The NYT Crossword is still very male dominated")
plt.ylabel("Relative frequency")
plt.xlabel("Year")
plt.legend()
pass
pivot=pd.concat([nytData.groupby("Day")["C1 Gender"].count()
,nytData[(nytData["C1 Gender"]!="F") & (nytData["C2 Gender"]!="F")].groupby("Day")["C1 Gender"].count()
,nytData[(nytData["C1 Gender"]=="F") | (nytData["C2 Gender"]=="F")].groupby("Day")["C1 Gender"].count()],axis=1)
pivot.columns=["Total","Male","Female"]
pivot.sort_values("Male",ascending=True,inplace=True)
plt.figure(figsize=(17,5))
plt.bar(pivot.index, pivot["Male"]/pivot["Total"], label="Male")
plt.bar(pivot.index, pivot["Female"]/pivot["Total"], label="Female", bottom=pivot["Male"]/pivot["Total"])
plt.title("The NYT Crossword is still very male dominated")
plt.ylabel("Relative frequency")
plt.xlabel("Day of week")
plt.legend()
pass
This first graph shows how the NYT is still very much male-dominated in terms of constructors, even as, since 2020, the number of female constructors are higher than the years before. Most crosswords are still made by male constructors.
The problem is exaserbated when the trend is split by the day of the week. Firstly, notice that the x-axis is generally by the days of the week, except for Sunday. Other than that, since the difficulty of the puzzle increases throughout the week, it is a possible reason why less women construct on those days. As discussed previously, the later days of the week are for crossword buffs, and hence, they may be less inclined to construct for those days, as it is a rather gated community. Earlier days in the week are indeed more accessible to them, which is why they may choose to pick those instead.
This is not to discount them as constructors, however. It may very well be the causes of external factors, like even personal preference that comes down to why they construct those puzzles less. However, it is still a problem, as there is a potential male bias in the fill towards later days, making an already hard puzzle even more inaccessible for a certain gender.
nrow=1
ncol=2
fig = plt.figure(figsize=(16,5))
gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)
axes = gs.subplots(sharex=True, sharey=False)
sns.lineplot(data=nytData,y="BNames",x="Year",hue="C1 Gender",ax=axes[0])
axes[0].set_ylabel("No. of boy names")
axes[0].set_title("Number of male names in NYT crosswords")
axes[0].set_ylim((3,8))
sns.lineplot(data=nytData,y="GNames",x="Year",hue="C1 Gender",ax=axes[1])
axes[1].set_ylabel("No. of girl names")
axes[1].set_title("Number of female names in NYT crosswords")
axes[1].set_ylim((3,8))
(3.0, 8.0)
This graph illustrates the importance of diversity in construction. Firstly, male names are used more than female names. Since USA Today has shown that more female names can be used, this is not exactly ideal. However, it still shows an important aspect of construction. Female constructors tend to use more female names than male constructors! Since they have grown up as a female, their life experiences and idols and likes would definitely be different. They bring a piece of their personality into the puzzle, incorporating what males may not characterise as common knowledge. While males have been trying to decrease the number of names they use, female constructors have beem capitalising on it, trying to integrate more of their character.
pivot=pd.pivot_table(data=nytData,index="C1 No.",columns="Day",values="Date",aggfunc="count")
pivot.fillna(0,inplace=True)
pivot=pivot.iloc[:10]
plt.figure(figsize=(9,7))
sns.heatmap(pivot[daysOfWeek], cmap="YlGnBu",linewidths=.5)
plt.ylabel("Number of puzzles constructed by the constructor")
plt.xlabel("Day of week")
plt.title("Heatmap of which puzzles newer constructors construct")
pass
We can see that as the week goes on, less and less newer constructors make the puzzle. This may be caused by the relative difficulty of constructing such puzzles, which is known to increase through the week. The abnomaly of Sunday can be explained by its difficulty being more simlar to a Wednesday/Thursday puzzle. As the difficulty increases, so will the puzzlemaking difficulty be. Since they are newer constructors, they have a higher chance of being rejected, and so making an early week puzzle is safer. Jeff Chen of XWordInfo, a crossword site, recommends newbies to not dive straight into constructing late-week puzzles. As discussed before, this low density of newer constructors may be caused by NYT being the gold standard, and it being very difficult to be up to the high standards that they have immediately.
nrow=1
ncol=2
fig = plt.figure(figsize=(16,5))
gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)
axes = gs.subplots(sharex=True, sharey=False)
plt.figure(figsize=(12,5))
sns.lineplot(ax=axes[1],data=nytData[(nytData["C1 No."]<=10)],x="Year",y="ALong",color="b")
sns.lineplot(ax=axes[1],data=nytData[(nytData["C1 No."]>10)],x="Year",y="ALong",color="g")
axes[1].set_ylabel("Long Score")
axes[1].set_title("Newer constructors have lower long scores to seasoned ones")
sns.lineplot(ax=axes[0],data=nytData[(nytData["C1 No."]<=10)],x="Year",y="AShort",color="b")
sns.lineplot(ax=axes[0],data=nytData[(nytData["C1 No."]>10)],x="Year",y="AShort",color="g")
axes[0].set_ylabel("Short Score")
axes[0].set_title("Newer constructors have quite similar short scores to seasoned ones")
axes[1].legend(handles=[mpatches.Patch(color='b'),mpatches.Patch(color='g')],labels=["Newer","Seasoned"])
pass
<Figure size 864x360 with 0 Axes>
Newer constructors do have some slight disadvantage when it comes to freshness. But that is to be expected as they are newer to it. These trends are expected and are not surprising. This does justify why its hard for newer constructors to get accepted: they are facing stiff competition. That's not to say that the NYT does not accept them, they still try to aid newer constructors.
plt.figure(figsize=(12,5))
ax=sns.histplot(data=collab,x="Year",color="b",bins=range(1997,2023))
plt.title("Number of collaborations have been increasing year on year")
plt.ylabel("Number of collaborations")
labels = [str(v) if v else '' for v in ax.containers[0].datavalues]
ax.bar_label(ax.containers[0], labels=labels)
pass
As the years have gone by, the number of collaborations have increased. This may be caused by better communication, and the community generally interacting more. More collaborations can benefit everyone. For the constructors, they have a wider range of experiences and can be a more diverse puzzle, being a puzzle for one and all. For solvers, they get to see a more robust puzzle, elevating their solving experience.
nrow=2
ncol=2
fig = plt.figure(figsize=(16,5))
gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0.2)
axes = gs.subplots(sharex=True, sharey=False)
sns.lineplot(data=nytData,x="Year",y="AShort",ax=axes[0,0])
sns.lineplot(data=collab,x="Year",y="AShort",color="g",ax=axes[0,0])
axes[0,0].set_ylabel("Short Score")
axes[0,0].set_title("Short score of crosswords over time, split by collaborations")
sns.lineplot(data=nytData,x="Year",y="ALong",ax=axes[1,0])
sns.lineplot(data=collab,x="Year",y="ALong",color="g",ax=axes[1,0])
axes[1,0].set_ylabel("Long Score")
axes[1,0].set_title("Long score of crosswords over time, split by collaborations")
sns.lineplot(data=nytData[(nytData["C1 No."]<10)],x="Year",y="AShort",ax=axes[0,1])
sns.lineplot(data=collab[(collab["C1 No."]<10) | (collab["C2 No."]<10)],x="Year",y="AShort",color="g",ax=axes[0,1])
axes[0,1].set_ylabel("Short Score")
axes[0,1].set_title("Short score of crosswords by newer constructors over time, split by collaborations")
sns.lineplot(data=nytData[(nytData["C1 No."]<10)],x="Year",y="ALong",ax=axes[1,1])
sns.lineplot(data=collab[(collab["C1 No."]<10) | (collab["C2 No."]<10)],x="Year",y="ALong",color="g",ax=axes[1,1])
axes[1,1].set_ylabel("Long Score")
axes[1,1].set_title("Long score of crosswords by newer constructors over time, split by collaborations")
axes[0,0].set_ylim((55,65))
axes[0,1].set_ylim((55,65))
axes[1,0].set_ylim((200,700))
axes[1,1].set_ylim((200,700))
#generating the four corners of the small multiple
plt.suptitle("How collaborations affect crossword quality")
pass
Overall, for both seasoned and new constructors, collaborations do not change the amount of crosswordese. This may just be because crosswordese is a necessary evil for "good" crossword puzzles, and hence are unavoidable. What can be controlled by the constructor is the long score, the freshness factor. For both types of constructors, their freshness fill slightly increases when they collaborate with somebody. Although the increase is not significant, we need to look at how these answers are scored. Since we are using a wordlist, an increase of just 10 would signify that the answer is a "better" answer, a more trendy word that is used. Hence, any sort of improvement helps.
Recommendations:
For crosswordese, nothing much can be done about it, it seems pretty much set in stone. For freshness, we can and should try to do better, with the aid of computer programs to aid us. From the current trend, more inclusivity is needed. Various programs provide valuable mentorship to constructors and assist them to debut in a crossword. The most help is required in late week crosswords. More needs to be done, so that a better picture is reflected of who is in our crosswords. Some outlets have been seeing a shift. For example, the L.A. Times Daily has been increasing the publishing of female constructor's puzzles, trying to keep it above 50%. These sort of actions can hopefully induct more marginalised groups into the crosswords, making it a better community. Of course, some outlets simply cannot do that, as with the New York Times. However, they should try their best, to aid some crosswords that are not exactly up to par by more marginalised constructors to give them a chance. Even though this idea is some what unfair, it also serves as some affirmitive action.
Limitations:
This project has largely relied on my own transformed data. Even though I have made best efforts to consider why and how my transformation of data is justified, it still is not perfect. Much data used in this project was transformed from original data sources, so it may be that they are somewhat inaccurate. Having said that however, some of the findings in this project match up largely with what has already been found elsewhere, hence they can generally be considered reliable. One example of this real limitation is how long word scoring using my own scale was rather iffy. Additionally, the only dataset I could find for Q4 was from the New York Times, which may have biased the results and findings, may not apply to other outlets.
Future works
Further exploration can be done on Q4 for the other outlets, as I am missing that data from the other outlets. A more concrete and objective metric can be used, as my methods are not exactly perfect. As more and more crossword types and formats appear, one can try to do a similar project on other variations of the crossword. A famous and interesting example to try this on would be the cryptic crossword, mostly played in the UK.
An appendix is included along with this project, which contains all the web scraping code.
Dataset references